146 research outputs found

    Self-supervised automated wrapper generation for weblog data extraction

    Get PDF
    Data extraction from the web is notoriously hard. Of the types of resources available on the web, weblogs are becoming increasingly important due to the continued growth of the blogosphere, but remain poorly explored. Past approaches to data extraction from weblogs have often involved manual intervention and suffer from low scalability. This paper proposes a fully automated information extraction methodology based on the use of web feeds and processing of HTML. The approach includes a model for generating a wrapper that exploits web feeds for deriving a set of extraction rules automatically. Instead of performing a pairwise comparison between posts, the model matches the values of the web feeds against their corresponding HTML elements retrieved from multiple weblog posts. It adopts a probabilistic approach for deriving a set of rules and automating the process of wrapper generation. An evaluation of the model is conducted on a dataset of 2,393 posts and the results (92% accuracy) show that the proposed technique enables robust extraction of weblog properties and can be applied across the blogosphere for applications such as improved information retrieval and more robust web preservation initiatives

    Intelligent Self-Repairable Web Wrappers

    Get PDF
    The amount of information available on the Web grows at an incredible high rate. Systems and procedures devised to extract these data from Web sources already exist, and different approaches and techniques have been investigated during the last years. On the one hand, reliable solutions should provide robust algorithms of Web data mining which could automatically face possible malfunctioning or failures. On the other, in literature there is a lack of solutions about the maintenance of these systems. Procedures that extract Web data may be strictly interconnected with the structure of the data source itself; thus, malfunctioning or acquisition of corrupted data could be caused, for example, by structural modifications of data sources brought by their owners. Nowadays, verification of data integrity and maintenance are mostly manually managed, in order to ensure that these systems work correctly and reliably. In this paper we propose a novel approach to create procedures able to extract data from Web sources -- the so called Web wrappers -- which can face possible malfunctioning caused by modifications of the structure of the data source, and can automatically repair themselves.\u

    Applying semantic web technologies to knowledge sharing in aerospace engineering

    Get PDF
    This paper details an integrated methodology to optimise Knowledge reuse and sharing, illustrated with a use case in the aeronautics domain. It uses Ontologies as a central modelling strategy for the Capture of Knowledge from legacy docu-ments via automated means, or directly in systems interfacing with Knowledge workers, via user-defined, web-based forms. The domain ontologies used for Knowledge Capture also guide the retrieval of the Knowledge extracted from the data using a Semantic Search System that provides support for multiple modalities during search. This approach has been applied and evaluated successfully within the aerospace domain, and is currently being extended for use in other domains on an increasingly large scale

    Multiple Representations in Geographic Information Systems

    Get PDF
    Geographic information systems (GIS) deal with data which can potentially be useful for a wide range of applications. However, the information needs of each application usually vary, specially in resolution, detail level, and representation style. This thesis presents a set of primitives that allow the specification of operational processes, such as transformations between representations, through the use of a dynamic schema.Sociedad Argentina de Informática e Investigación Operativ

    Data driven Xpath generation

    Get PDF
    The XPath query language offers a standard for information extraction from HTML documents. Therefore, the DOM tree represen- tation is typically used, which models the hierarchical structure of the document. One of the key aspects of HTML is the separation of data and the structure that is used to represent it. A consequence thereof is that data extraction algorithms usually fail to identify data if the structure of a document is changed. In this paper, it is investigated how a set of tab- ular oriented XPath queries can be adapted in such a way it deals with modifications in the DOM tree of an HTML document. The basic idea is hereby that if data has already been extracted in the past, it could be used to reconstruct XPath queries that retrieve the data from a different DOM tree. Experimental results show the accuracy of our method

    Evolution in the number of authors of computer science publications

    Get PDF
    This article analyses the evolution in the number of authors of scientific publications in computer science (CS). This analysis is based on a framework that structures CS into 17 constituent areas, proposed by Wainer et al. (Commun ACM 56(8):67–63, 2013), so that indicators can be calculated for each one in order to make comparisons. We collected and mined over 200,000 article references from 81 conferences and journals in the considered CS areas, spanning a 60-year period (1954–2014). The main insights of this article are that all CS areas witness an increase in the average number of authors, in every decade, with just one slight exception. We ordered the article references by number of authors, in ascending chronological order and grouped them into decades. For each CS area, we provide a perspective of how many groups (1-author papers, 2-author papers and so on) must be considered to reach certain proportions of the total for that CS area, e.g., the 90th and 95th percentiles. Different CS areas require different number of groups to reach those percentiles. For all 17 CS areas, an analysis of the point in time in which publications with n+1 authors overtake the publications with n authors is presented. Finally, we analyse the average number of authors and their rate of increase.This work was supported by FCT - Fundação para a Ciência e Tecnologia within the Project Scope UID/CEC/00319/2013

    Dispersal syndromes in challenging environments: A cross‐species experiment

    Full text link
    Dispersal is a central biological process tightly integrated into life-histories, morphology, physiology and behaviour. Such associations, or syndromes, are anticipated to impact the eco-evolutionary dynamics of spatially structured populations, and cascade into ecosystem processes. As for dispersal on its own, these syndromes are likely neither fixed nor random, but conditional on the experienced environment. We experimentally studied how dispersal propensity varies with individuals' phenotype and local environmental harshness using 15 species ranging from protists to vertebrates. We reveal a general phenotypic dispersal syndrome across studied species, with dispersers being larger, more active and having a marked locomotion-oriented morphology and a strengthening of the link between dispersal and some phenotypic traits with environmental harshness. Our proof-of-concept metacommunity model further reveals cascading effects of context-dependent syndromes on the local and regional organisation of functional diversity. Our study opens new avenues to advance our understanding of the functioning of spatially structured populations, communities and ecosystems. Keywords: context-dependent dispersal; dispersal strategy; distributed experiment; predation risk; resource limitatio

    Resilience trinity: Safeguarding ecosystem functioning and services across three different time horizons and decision contexts

    Get PDF
    Ensuring ecosystem resilience is an intuitive approach to safeguard the functioning of ecosystems and hence the future provisioning of ecosystem services (ES). However, resilience is a multi‐faceted concept that is difficult to operationalize. Focusing on resilience mechanisms, such as diversity, network architectures or adaptive capacity, has recently been suggested as means to operationalize resilience. Still, the focus on mechanisms is not specific enough. We suggest a conceptual framework, resilience trinity, to facilitate management based on resilience mechanisms in three distinctive decision contexts and time‐horizons: 1) reactive, when there is an imminent threat to ES resilience and a high pressure to act, 2) adjustive, when the threat is known in general but there is still time to adapt management and 3) provident, when time horizons are very long and the nature of the threats is uncertain, leading to a low willingness to act. Resilience has different interpretations and implications at these different time horizons, which also prevail in different disciplines. Social ecology, ecology and engineering are often implicitly focussing on provident, adjustive or reactive resilience, respectively, but these different notions of resilience and their corresponding social, ecological and economic tradeoffs need to be reconciled. Otherwise, we keep risking unintended consequences of reactive actions, or shying away from provident action because of uncertainties that cannot be reduced. The suggested trinity of time horizons and their decision contexts could help ensuring that longer‐term management actions are not missed while urgent threats to ES are given priority
    corecore